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Abstract 

In this paper, we study randomized reduction 
methods, which reduce high-dimensional fea¬ 
tures into low-dimensional space by randomized 
methods (e.g., random projection, random hash¬ 
ing), for large-scale high-dimensional classifica¬ 
tion. Previous theoretical results on randomized 
reduction methods hinge on strong assumptions 
about the data, e.g., low rank of the data matrix or 
a large separable margin of classification, which 
hinder their applications in broad domains. To 
address these limitations, we propose dual-sparse 
regularized randomized reduction methods that 
introduce a sparse regularizer into the reduced 
dual problem. Under a mild condition that the 
original dual solution is a (nearly) sparse vec¬ 
tor, we show that the resulting dual solution is 
close to the original dual solution and concen¬ 
trates on its support set. In numerical experi¬ 
ments, we present an empirical study to support 
the analysis and we also present a novel appli¬ 
cation of the dual-sparse regularized randomized 
reduction methods to reducing the communica¬ 
tion cost of distributed learning from large-scale 
high-dimensional data. 

1. Introduction 

As the scale and dimensionality of data continue to grow 
in many applications (e.g., bioinformatics, finance, com- 
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puter vision, medical informatics) (Sanchez et al., 2013; 
Mitchell et al., 2004; Simianer et al., 2012; Bartz et al., 
2011), it becomes critical to develop efficient and effec¬ 
tive algorithms to solve big data machine learning prob¬ 
lems. Randomized reduction methods for large-scale or 
high-dimensional data analytics have received a great deal 
of attention in recent years (Mahoney & Drineas, 2009; 
Shi et al., 2012; Paul et al., 2013; Weinberger et al., 2009; 
Mahoney, 2011). By either reducing the dimensionality 
(referred to as feature reduction) or reducing the number 
of training instances (referred to as instance reduction), 
the resulting problem has a smaller size of training data 
that is not only memory-efficient but also computation- 
efficient. While randomized instance reduction has been 
studied a lot for fast least square regression (Drineas et al., 
2008; 2006; 2011; Maetal., 2014), randomized feature 
reduction is more popular for linear classification (Blum, 
2005; Shi et al., 2012; Paul et al., 2013; Weinberger et al., 
2009; Shi et al., 2009a) (e.g., random hashing is a notice¬ 
able built-in tool in Vowpal Wabbit 1 , a fast learning library, 
for solving high-dimensional problems.). In this paper, we 
focus on the latter technique and refer to randomized fea¬ 
ture reduction as randomized reduction for short. 

Although several theoretical properties have been exam¬ 
ined for randomized reduction methods when applied to 
classification, e.g., generalization performance (Paul et al., 
2013), preservation of margin (Blum, 2005; Balcan et al., 
2006; Shi et al., 2012) and the recovery error of the 
model (Zhang et al., 2014), these previous results reply 
on strong assumptions about the data. For example, both 
(Paul et al., 2013) and (Zhang et al., 2014) assume the data 
matrix is of low-rank, and (Blum, 2005; Balcan et al., 
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2006; Shi et al., 2012) make a assumption that all examples 
in the original space are separated with a positive margin 
(with a high probability). Another analysis in (Zhang et al., 
2014) assumes the weight vector for classification is sparse. 
These assumptions are too strong to hold in many real ap¬ 
plications. 

Contributions. To address these limitations, we propose 
dual-sparse regularized randomized reduction methods re¬ 
ferred to as DSRR by leveraging the (near) sparsity of dual 
solutions for large-scale high-dimensional (LSHD) classi¬ 
fication problems (i.e., the number of (effective) support 
vectors is small compared to the total number of examples). 
In particular, we add a dual-sparse regularizer into the re¬ 
duced dual problem. We present a novel theoretical analy¬ 
sis of the recovery error of the dual variables and the primal 
variable and study its implication for different randomized 
reduction methods (e.g., random projection, random hash¬ 
ing and random sampling). 

Novelties. Compared with previous works (Blum, 2005; 
Balcan et al., 2006; Shi et al., 2012; Paul et al., 2013), our 
theoretical analysis demands a mild assumption about the 
data and directly provides guarantee on a small recovery 
error of the obtained model, which is critical for subse¬ 
quent analysis, e.g., feature selection (Guyonet al., 2002; 
Brank et al., 2002) and model interpretation (Ratsch et al., 
2005; Sonnenburg & Franc, 2010; Rtsch et al., 2005; 
Sonnenburg et al., 2007; Ben-Hur et al., 2008). For exam¬ 
ple, when exploiting a linear model to classify people into 
sick or not sick based on genomic markers, the learned 
weight vector is important for understanding the effect of 
different genomic markers on the disease and for designing 
effective medicine (Jostins & Barrett, 2011; Kang & Cho, 
2011). In addition, the recovery could also increase the pre¬ 
dictive performance, in particular when there exists noise in 
the original features (Goldberger et al., 2005). 

Compared with (Zhang et al., 2014) that proposes to re¬ 
cover a linear model in the original feature space by dual 
recovery, i.e., constructing a weight vector using the dual 
variables learned from the reduced problem and the origi¬ 
nal feature vectors, our methods are better in that (i) we rely 
on a more realistic assumption of the sparsity of dual vari¬ 
ables (e.g., in support vector machine (S VM)); (ii) we ana¬ 
lyze both smooth loss functions and non-smooth loss func¬ 
tions (they focused on smooth functions); (iii) we study 
different randomized reduction methods in the same frame¬ 
work not just the random projection. 

In numerical experiments, we present an empirical study 
on a real data set to support our analysis and we also 
demonstrate a novel application of the reduction and re¬ 
covery framework in distributed learning from LSHD data, 
which combines the benefits of the two complementary 
techniques for addressing big data problems. Distributed 


learning/optimization recently receives significant interest 
in solving big data problems (Jaggi et al., 2014; Li et al., 
2014; Yang, 2013; Agarwal et al., 2011). However, it is no¬ 
torious for high communication cost, especially when the 
dimensionality of data is very high. By solving a dimen¬ 
sionality reduced data problem and using the recovered so¬ 
lution as an initial solution to the distributed optimization 
on the original data, we can reduce the number of itera¬ 
tions and the communication cost. In practice, we employ 
the recently developed distributed stochastic dual coordi¬ 
nate ascent algorithm (Yang, 2013), and observe that using 
the recovered solution as an initial solution we are able to 
attain almost the same performance with only one or two 
communications of high dimensional vectors among mul¬ 
tiple machines. 

2. Preliminaries 

Let (xi,yi),i = 1,... ,n denote a set of training exam¬ 
ples, where x^ £ £ {1,-1}. Assume both n and 

d are very large. The goal of classification is to solve the 
following optimization problem: 

1 n \ 

w* = arg min - V'^(w T x l j/ i ) + ^||w||| (1) 

wGR d n 2 

i =1 

where t(zy) is a convex loss function and A is a regulariza¬ 
tion parameter. Using the conjugate function, we can turn 
the problem into a dual problem: 

n 

a* = arg max- t\ (a*) - ——ra T X T Xa (2) 

aei™ 2Xn 2 

i= 1 

where X = (xi,..., x.„ ) is the data matrix and £*(a) is 
the convex conjugate function of £(zyi). Given the optimal 
dual solution a*, the optimal primal solution can be com¬ 
puted by w* = -^Xa*. For LSHD problems, directly 
solving the primal problem (1) or the dual problem (2) 
could be very expensive. We aim to address the challenge 
by randomized reduction methods. Let A(-) : K. d —>■ R m 
denote a randomized reduction operator that reduces a d- 
dimensional feature vector into m-dimensional feature vec¬ 
tor. Let x = ,4(x) denote the reduced feature vector. With 
the reduced feature vectors Xi,..., x„ of the training ex¬ 
amples, a conventional approach is to solve the following 
reduced primal problem 

1 n A . 

u* = arg min - V'^(u T x i t/ i ) + —||u||| (3) 

uer n z ' 2 

i= 1 

or its the dual problem 

1 n 1 _ _ 

a* = arg max- t\ (a;) - —— ^a T X T Xa (4) 

aei™ nX-^ 2A n 2 

i— 1 

where X = (xi,...,x„) £ R mxrl . Previous studies 
have analyzed the reduced problems for random projec¬ 
tion methods and proved the preservation of margin (Blum, 
2005; Shi et al., 2012) and the preservation of minimum 
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enclosing ball (Paul et al., 2013). Zhang et al. (2014) pro¬ 
posed a dual recovery approach that constructs a recov¬ 
ered solution by w* = — A- an d P rove d the 

recovery error for random projection under the assump¬ 
tion of low-rank data matrix or sparse w*. In addition, 
they also showed that the naive recovery by A 1 u* (when 
A(x) = Ax) has a large recovery error. 

One deficiency with the simple dual recovery approach is 
that due to the reduction in the feature space, many non¬ 
support vectors for the original optimization problem will 
become support vectors, which could result in the corrup¬ 
tion in the recovery error. As a result, the original analysis 
of dual recovery method requires a strong assumption of 
data (i.e., the low rank assumption). In this work, we plan 
to address this limitation in a different way, which allows 
us to relax the assumption significantly. 


to using a max-margin loss with a smaller margin, which is 
intuitive because examples become difficult to separate af¬ 
ter dimensionality reduction and is consistent with several 
previous studies that the margin is reduced in the reduced 
feature space (Blum, 2005; Shi et al., 2012). Similarly for 
squared hinge loss, the equivalent primal problem is 
1 " A 

u ^-E^-( uTX ^)+ 2 I|U|1 ' (7) 

i =1 

where £ 2 {z) = max( 0,7 — z) 2 . 

Although adding a dual-sparse regularizer is intuitive and 
can be motivated from previous results, we emphasize that 
the proposed dual-sparse formulation provides a new per¬ 
spective and bounding the dual recovery error || 5» — a* || is 
a non-trivial task, which is a major contribution of this pa¬ 
per. We first state our main result in Theorem 1 for smooth 
loss functions. 


3. DSRR and its Guarantee 


To reduce the number of or the contribution of training in¬ 
stances that are non-support vectors in the original opti¬ 
mization problem and are transformed into support vectors 
due to the reduction of the feature space, we employ a sim¬ 
ple trick that adds a dual-sparse regularization to the re¬ 
duced dual problem. In particular, we solve the following 
problem: 

5* = (5) 

1 n 1 _ _ 1 

arg max- £* (a*) — —— -a T X T Xa - R(a ) 

ael" n ' 2A n 2 n 

i—1 

where R(a) = r||a||i, and r > 0 is a regularization pa¬ 
rameter, whose theoretical value will be revealed later. 


To further understand the added dual-sparse regularizer, we 
consider SVM, where the loss function can be either the 
hinge loss (a non-smooth function) t(zy) = max( 0 , 1 —zy) 
or the squared hinge loss (a smooth function) £(zy) = 
max(0,1 — zy) 2 . We first consider the hinge loss, where 
£*(oti) = a^yi for a^i £ [—1,0]. Then the new dual prob¬ 
lem is equivalent to 


max — 
QioyG[—l,0] n Tl 


1 

-Y- 


^ _ 'y 

aiVi ~ Xa ~ 

Using variable transformation —atyt —> /?*, the above 
problem is equivalent to 


i=l 


1 1 ^ ^ 

max -y^/3i(l- t) —(/3 o y) T X T X(/J o y) 
/3e o,i n n ' 2An 2 

2 = 1 

Changing into the primal form, we have 

1 71 A 

max -V'^ 1 -r(u T x i y i ) + — ||u||| (6) 

ueR m n 2 —' 2 

2=1 

where £ 7 (z) = max( 0,7 — 2 r) is a max-margin loss with 
margin given by 7 . It can be understood that adding the t\ 
regularization in the reduced problem of SVM is equivalent 


Theorem 1. Let 5* be the optimal dual solution to (5). 
Assume ct* is s-sparse with the support set given by S. If 
t > f^\\(X T X — X T X)a4 00 , then we have 

||[a*]s c ||i < 3||[a*]s — [a*]s||i (8) 

Furthermore, if I{z) is a L-smootli loss function 2 , we have 
||5* — a»|| 2 < 3 tLt/s, ||S* — a* ||i < 12 tLs (9) 
||[5*]s — [a*]s||i < 3tLs, ||[5*]5 c ||i < 9ris (10) 
where S c is the complement ofS, and [a] 5 is a vector that 
only contains the elements of a in the set S. 

Remark 1: The proof is presented in Appendix A. It can be 
seen that the dual recovery error is proportional to the value 
of r which is dependent on || ( X T X — A' T A')cr» ||oo, which 
we can bound without using any assumption about the data 
matrix or the optimal dual variable a*. In contrast, pre¬ 
vious bounds (Zhang et al., 2013; 2014; Paul et al., 2013) 
depend on || X T X — X 1 A'|| 2 , which requires the low rank 
assumption on X. In next section, we provide an up¬ 
per bound of 7^-||(2f T X — J V T X)a*|| oc that will allow 
us to understand how the reduced dimensionality to af¬ 
fects the recovery error. Essentially, the results indicate 
that for random projection, randomized Hadamard trans¬ 
form and random hashing, 7 ^-||(X T X — X T X)a t ,\\ 00 < 

°(V /,i2S ^r^)ll w * II 2 w '^ a probability 1 — 5, and thus 
the recovery error will be scaled as y/l/m in terms of to - 
the same order of recovery error as in (Zhang et al., 2013; 
2014) that assumes low rank of the data matrix. 

Remark 2: We would like to make a connection with 
LASSO for sparse signal recovery. In sparse signal 
recovery under noise measurements f = f/w* + e, 
where e denotes the noise in measurements, if a LASSO 
min w i||C/w—f||! + A||w||i is solved for the solution, then 

2 A function is L-smooth if its gradient is L-Lipschitz contin¬ 
uous. 
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the regularization parameter A is required to be larger than 
the quantity ||[/' T e|| 00 that depends on the noise in order to 
have an accurate recovery (Eldar & Kutyniok, 2012). Sim¬ 
ilarly in our formulation, the added £\ regularization r||a|| 1 
is to counteract the noise in XX T as compared with X X 1 
and the value of r is dependent on the noise. 


To present the theoretical result on the non-smooth loss 
functions, we need to introduce restricted eigen-value con¬ 
ditions similar to those used in the sparse recovery analy¬ 
sis for LASSO (Bickel et al„ 2009; Xiao & Zhang, 2013). 
In particular, we introduce the following definition of re¬ 
stricted eigen-value condition. 


Definition 2. Given an integer s > 0, we define 

= {a £ R" : ||a|| 2 < 1, |M|! < 7i}. 

We say that X satisfies the restricted eigenvalue condition 
at sparsity level s if there exist positive constants pf and 
pj such that 

a T X T Xa _ . r a T X T Xa 

sup -, p s = mr -. 

Q:G/C n)S ^ C*E/C n)S Tl 

We also define another quantity that measures the restricted 
eigen-value of X T X — X T X, namely 


pi = 


a s = sup 

ae/C„ . 


(X T X -X T X)t 


( 11 ) 


Theorem 3. Let a* he the optimal dual solution to (5). 
Assume a* is s-sparse with the support set given by S. If 
t > -^||(X T X — X T X)a* Hoc, then we have 

|p*]sc||i < 3||[5*]s - [a*]s||i 

Assume the data matrix X satisfies the restricted eigen¬ 
value condition at sparsity level 16s and er 16s < pf Gs , we 
have 


|a* — ce *||2 < 


3A 


2(Pi6s Ties) 


Tt/s 


I a* — a*|h < 


6A 


(.P 


16 s 


— CTl6s) 


TS 


Remark 3: The proof is included in Appendix B. Com¬ 
pared to smooth loss functions, the conditions that guaran¬ 
tee a small recovery for non-smooth loss functions are more 
restricted. In next section, we will provide a bound on <jiq s 
to further understand the condition of (Tu, s < pf, s , which 

essentially implies that m> Q ( slog(n/s) 


Last but not least, we provide a theoretical result on the re¬ 
covery error for the nearly sparse optimal dual variable a*. 
We state the result for smooth loss functions. To quantify 
the near sparsity, we let a% £ R" denote a vector that zeros 
all entries in a* except for the top-s elements in magnitude 
and assume aj satisfies the following condition: 


V£*(at) + -X T Xat 


<? 


( 12 ) 


where V£*(a) = (V£* (ai),..., V£* (a n )) T . The above 


condition can be considered as a sub-optimality condi¬ 
tion (Boyd & Vandenberghe, 2004) of a) measured in the 
infinite norm. Lor the optimal solution a*, we have 

V£*(a*) + ±X T Xa* = 0. 

Theorem 4. Let a* he the optimal dual solution to (5). 
Assume a* is nearly s-sparse such that (12) holds with 
the support set of aj given by S. If t > -^\\(X T X — 
X T X)a*||oo + 2f, then we have 

II[a*]se||i < 3||[a*]s — [a*]s||i 
Furthermore, if £(z) is a L-smooth loss function, we have 
||a* — a*|| 2 < StL^/s, ||a* — a*||i < 12 tLs (13) 
||[a*]s — [a*]s||i < 3tLs, ||[a*]sc|| 1 < 9tLs (14) 

Remark 4: The proof appears in Appendix C. Compared 
to Theorem 1 for exactly sparse optimal dual solution, the 
dual recovery error bound for nearly sparse optimal dual 
solution is increased by 6Ly/s£ for f 2 norm and by 21 Ls£ 
for l\ norm. 

Linally, we note that with the recovery error bound for the 
dual solution, we can easily derive an error bound for the 
primal solution w* = — ^X5*. Below we present a the¬ 
orem for smooth loss functions. One can easily extend the 
result to non-smooth loss functions. 

Theorem 5. Let w* he the recovered primal solution using 

a* the optimal dual solution to (5). Assume a* is s-sparse 

and £(z) is a L-smooth loss function. Ifr > -^\\{X T X — 

X T X)a4oo then we have 

||w* — w*|| 2 < —! SLt^s 
A n 

where a\ is the maximum singular value of X. Further¬ 
more if y^X T X has a restricted eigen-value p+ 6s at spar¬ 
sity level 16s, then 

ii ~ ii V Pl6s r - 

||w* - w*|| 2 < v 3 Lryfs 
A-fn 

Remark 5: Since pf 6s is always less than a\/n , the sec¬ 
ond result if the restricted eigen-value condition holds is 
always better than the first result. With the bound of r as 
revealed later, we can see that the error of w* scales as 
0(yX|| w+ || 2 ) in terms of sparsity s of a*, the reduced 
dimensionality m and the magnitude of w*. A similar or¬ 
der of error bound was established in (Zhang et al., 2014) 
assuming w* is s-sparse and X is approximately low rank. 
In contrast, we do not assume X is approximately low rank. 

4. Analysis 

In this section, we first provide upper bound analysis of 
~^\\(X T X — X T X)a*||oo and <r s . To facilitate our anal¬ 
ysis, we define 

A = ^(X T X - X T X)a* 

An 
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4.1. Bounding || AHoo 


recovery error. 


A critical condition in both Theorem 1 and Theorem 3 is 
r > HAIIoo. In order to reveal the theoretical value of r 
and its implication for various randomized reduction meth¬ 
ods, we need to bound || AHoq. We first provide a general 
analysis and then study its implication for various random¬ 
ized reduction methods separately. The analysis is based 
on the following assumption, which essentially is indicated 
by Johnson-Lindenstrauss (JL)-type lemmas. 

Assumption 1 (Al). Let A(x) = Ax.be a linear projection 
operator where A £ R mxd such that for any given x £ 
with a high probability 1 — 5, we have 

|P X ll2 - IMI 2 I < A4,<5|l X ll2 

where e A j depends on to, 8 and possibly d. 


With this assumption, we have the following theorem re¬ 
garding the upper bound of || A Hoc. 

Theorem 6. Suppose A £ M mx “ satisfies Assumption A, 
then with a high probability 1 — 2 8 we have 
IIAHoo < R\\Mv*he A ,s/n 
where R = maxj || x^ || 2 - 


Proof. 

-!-(X t A - X T X)a» = -^-(X t A t AX - X T X)a» 


= ^-X t (A t A - I)Xa * = X T (I - A t A) w* 

An 

where we use the fact w* = Then 

-5-[(A t A - X T X)aA = xj (J - A T A)w* 

A n 

Therefore in order to bound HAHoo, we need to bound 
xj (/ — A T A)w* for all i £ [n]. We first bound for in¬ 
dividual i and then apply the union bound. Let x, and w : , 
be normalized version of Xi and w*, i.e., Xj = x*/11x*11 2 
and w* = w*/||w*|| 2 . Suppose Assumption A is satisfied, 
then with a probability 1 — <5, 


xj A T A w* — xj w* 


|[A(x t + w*)|[| - ||A(xj - Wfr )|[2 
4 


f-A,S 

— 


) < e A ,6 


Similarly with a probability 1 — <5, 


x T AJ Aw — x l w, = 


||A(xj + w*)||| - ||A(xi - w* 


- x Jw* > (IIx^II2 + ll W *||2) > ~£A,s 

Therefore with a probability 1 — 26, we have 
|xTA T Aw* - xj w* | 


< ||Xj||2||w*||2|x7A T Aw* -X T W*| < 11 X.j || 2 || W* || 2 6A,5 

Then applying union bound, we complete the proof. 


Next, we discuss four classes of randomized reduction op¬ 
erators, namely random projection, randomized Hadamard 
transform, random hashing and random sampling, and 
study the corresponding e A ,s and their implications for the 


Random Projection. Random projection has been em¬ 
ployed widely for dimension reduction. The projection 
operator A is usually sampled from sub-Gaussian distri¬ 
butions with mean 0 and variance 1/m, e.g., (i) Gaussian 
distribution: Ajj ~ Af(0, 1/m), (ii) Rademacher distribu¬ 
tion: Pr(Ajj = ±1 /y/rn) = 0.5, (iii) discrete distribution: 
Pr (A i:j = ±s/Z/m) = 1/6 and Pr(A, 7 = 0) = 2/3. The 
last two distributions for dimensionality reduction were 
proposed and analyzed in (Achlioptas, 2003). The follow¬ 
ing lemma is the general JL-type lemma for A with sub- 
Gaussian entries, which reveals the value of e AA in As¬ 
sumption A. 


Lemma 1. (Nelson) Let A £ R mxd be a random matrix 
with subGaussian entries of mean 0 and variance 1/m . 
For any given x with a probability 1 — 6, we have 


|||Al x ||l - || x ||l| < c 


log(l/A), 


where c is some small universal constant. 


2 

2 


Randomized Hadamard Transform. Randomized 
Hadamard transform was introduced to speed-up random 
projection, reducing the computational time 3 of random 
projection from 0(dm ) to O(dlogd) or even O(dlogm). 
The projection matrix A is of the form A = PHD, where 

• D £ R dxd is a diagonal matrix with Du = ±1 with 
equal probabilities. 

• H is the d x d Hadamard matrix (assuming d is a 
power of 2), scaled by 1 /s/d. 

• P £ R mxd is typically a sparse matrix that facili¬ 
ties computing Px. Several choices of P are possi¬ 
ble (Nelson; Ailon & Chazelle, 2009; Tropp, 2011). 
Below we provide a JL-type lemma for a randomized 
Hadamard transform with P £ R m x d that samples m 

coordinates from \—HDx with replacement. 

\ m r 


Lemma 2. (Nelson) Let A = sf^PHD £ R mxd be a 
randomized Hadamard transform with P being a random 
sampling matrix. For any given x with a probability 1 — 6, 
we have 


\\\Ml Hl x ll!|<c 


log(l/<5) log(d/5) 


TO 


where c is some small universal constant. 


2 

2 


Remark 6: Compared to random projection, there is an 
additional yTog (d/6) factor in e A} s. However, it can 
be removed by applying an additional random projection. 

In particular, if we let A = yJ^P'PHD £ S. mxd , 
where P £ R txd is a random sampling matrix with t = 

Refers to the running time of computing Ax. 
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m\og(d/8) and P' £ R mxt is a random projection ma¬ 
trix that satisfies Lemma 1, then we have the same order of 
6a,s- Please refer to (Nelson) for more details. 


Random Hashing. Another line of work to speed-up ran¬ 
dom projection is random hashing which makes the pro¬ 
jection matrix A much sparser and takes advantage of the 
sparsity of feature vectors. It was introduced in (Shi et al., 
2009b) for dimensionality reduction and later was im¬ 
proved to an unbiased version by (Weinberger et al., 2009) 
with some theoretical analysis. Dasgupta et al. (2010) pro¬ 
vided a rigorous analysis of the unbiased random hash¬ 
ing. Recently, Kane & Nelson (2014) proposed two new 
random hashing algorithms with a slightly sparser ran¬ 
dom matrix A. Here we provide a JL-type lemma for 
the random hashing algorithm in (Weinberger et al., 2009; 
Dasgupta et al., 2010). Let h : N — > [to] denote a ran¬ 
dom hashing function, and £ = (£i,...,£d) denote a 
Rademacher random variable, i.e., , i = 1,..., d are in¬ 

dependent and £ {1, —1} with equal probabilities. The 
projection matrix A can be written as A = HD, where 
D £ R dxd is a diagonal matrix with Djj = £j, and 
H £ R mxd with Hij = 5i,h(j) 4 - Under the random matrix 
A, the feature vector x £ R d is reduced to x £ R m , where 
[x]i = The following JL-type Lemma is 

a basic result from (Dasgupta et al., 2010) with a rephras¬ 
ing. 


Lemma 3. Let A = HD £ M m x d be a random hashing 
matrix. For any given vector x £ R rf such that 
for S < 0.1, with a probability 1 — 3^, we have 


Mi¬ 


ls < 


121og(l/<S). 


where c = 9,^/m/?>\og l ^ 2 {\/8) log 2 (to/ S). 


2 

2 


Remark 7: Compared to random projection, there is an 
additional condition on the feature vector ||x||oo < -^jr- 
However, it can be removed by applying an extra precon¬ 
ditioner P to x before applying the projection matrix A, 
i.e., x = HDPx. Two preconditioners were discussed 
in (Dasgupta et al., 2010), with one corresponding to du¬ 
plicating x c times and scaling it by 1 /yfc and another 
one given by P £ R dxd which consists of d/6 diago¬ 
nal blocks of 6 x 6 randomized Hadamard matrix, where 
6 = 6clog(3c/<5). The running time of the reduction using 
the later preconditioner is 0{d log c log log c). 

Random Sampling. Last we discuss random sampling 
and compare with the aforementioned randomized reduc¬ 
tion methods. In fact, the JL-type lemma for random sam¬ 
pling is implicit in the proof of Lemma 2. We make it ex¬ 
plicit in the following lemma. 

Lemma 4. Let A = \ ^-P £ R mxd be a scaled random 


4 5ij = 1 if i = j, and 0 otherwise. 


sampling matrix where P £ R mxd samples m coordinates 
with replacement. Then with a probability 1 — 8, we have 


Px|| 2 


I 2 < 


l|x||oo . /3d log(l/ 8) 2 

IH7V m l|X " 2 


Remark 8: Compared with other three randomized re¬ 
duction methods, there is an additional \fd factor in 

IMI 2 

€a,s, which could result in a much larger ca,s and con¬ 
sequentially a larger recovery error. That is why the ran¬ 
domized Hadamard transform was introduced to make this 
additional factor close to a constant. 


From the above discussions, we can conclude that with ran¬ 
dom projection, randomized Hadamard transform and ran¬ 
dom hashing, with a probability 1 — <5 we have, 

II AHoo = max |x/" (/ — A T A)w*| 


< cR 


\og{n/5) 


m 


w * 2 . 


which essentially indicates that r > 2c.Ry log ™^ ll w *l| 2 - 


4.2. Bounding a s for non-smooth case 


Another condition in Theorem 3 is to require <jiq s < pf &s . 
Since pj" 6s is dependent on the data, we provide an upper 
bound of ( 7 i6 S to further understand the condition. In the 

following analysis, we assume ca,s = 0{\j Re¬ 
call the definition of a s : 


cr s = sup 
aetc n 


{X T X -X T X)c 


(15) 


We provide a bound of cr s below. 


The key idea is to use the convex relaxation of /C„. s . Define 
<S n ,s = {a £ R" : ||ck ||2 < 1, ||oj|o < s}. It was shown 
in (Plan & Vershynin, 2011) that conv(S n , s ) C /C„ iS C 
2conv(S n ,s), where conv(S) is the convex hull of the set 
S. It is not difficult to show that (see the supplement) 
max |(Xa) T (/ — A 4 ^A){Xa)\ 

<*€fcn,s 

<4 max \iXai) T {I — AJ A){Xot 2 )\ 
CKi,Q'2G«S Tl)S 

Let ui = Jfai and 112 = Xai- For any fixed ai,a 2 £ 
S nyS , with a probability 1 — <5 we can have 

1, 


-\(X ai )'(I-A'A)(Xa 2 )\ = 0 \p/ 


W) 

m J 


where we use 


max 

aeSn,s 


\\Xa\\l 


< max 
aelCn . 


Il^ctll 2 , 

Ps 


n oieJCn, s n 

Then by using Lemma 3.3 in (Plan & Vershynin, 2011) 
about the entropy of S U}S and the union bound, we can ar¬ 
rive at the following upper bound for a s . 
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Theorem 7. With a probability 1 — 8, we have 
cr s <0 ( p 


(log(l/<5) + slog(n/s)) 


Remark 9: With above result, we can further understand 
the condition <t\ I jS < pj~ 6s , which amounts to 


O 



(log(l/(5) + slog(n/s)) 


— Pl6s’ 


i.e., m > fl(Ki 6s (log(l/<5) + slog(n/s))) where k\q s = 
Pips /Pitts is the restricted condition number of the data ma¬ 
trix. 


5. Numerical Experiments 

In this section, we provide a case study in support of DSRR 
and the theoretical analysis, and a demonstration of the ap¬ 
plication of DSRR to distributed optimization. 

A case study on text classification. We use the RCV1- 
binary data (Lewis et al., 2004) to conduct a case study. 
The data contains 697, 641 documents and 47, 236 features. 
We use a splitting 677, 399/20, 242 for training and test¬ 
ing. The feature vectors were normalized such that the £2 
norm is equal to 1. We only report the results using random 
hashing since it is the most efficient, while other random¬ 
ized reduction methods (except for random sampling) have 
similar performance. For the loss function, we use both 
the squared hinge loss (smooth) and the hinge loss (non¬ 
smooth). We aim to examine two questions related to our 
analysis and motivation (i) how does the value of r affect 
the recovery error? (ii) how does the number of samples m 
affect the recovery error? 

We vary the value of r among 0, 0.1, 0.2,..., 0.9, the value 
of to among 1024,2048,4096,8192, and the value of A 
among 0.001,0.00001. Note that r = 0 corresponds to 
the randomized reduction approach without the sparse reg- 
ularizer. The results averaged over 5 random trials are 
shown in Figure 1 for the squared hinge loss and in Fig¬ 
ure 2 for the hinge loss. We first analyze the results in 
Figure 1. We can observe that when r increases the ra¬ 
tio of || [ ~^“tlfc^ ] 1 iS || 1 decreases indicating that the magni¬ 
tude of dual variables for the original non-support vectors 
decreases. This is intuitive and consistent with our moti¬ 
vation. The recovery error of the dual solution (middle) 
first decreases and then increases. This can be partially ex¬ 
plained by the theoretical result in Theorem 1 . When the 
value of t becomes larger than a certain threshold making 
r > || A Hoc hold, then Theorem 1 implies that a larger r 
will lead to a larger error. On the other hand, when r is less 
than the threshold, the dual recovery error will decrease as 
r increases. In addition, the figures exhibit that the thresh¬ 
olds for larger to are smaller which is consistent with our 
analysis of UAUoo = 0(-\/V to). The difference between 



Figure 1. Recovery error for squared hinge loss. From left to 
LsfJii— t II 2 vs t, and 


right: 

T. 


w* — w* 


l[“*ls-[“*]slli 




Figure 2. Same curves as above but for non-smooth hinge loss. 


A = 0.001 and A = 0.00001 is because that smaller A will 
lead to larger || || 2 . In terms of the hinge loss, we observe 

similar trends, however, the recovery is much more difficult 
than that for squared hinge loss especially when the value 
of A is small. 

An application to distributed learning. Although in some 
cases the solution learned in the reduced space can provide 
sufficiently good performance, it usually performs worse 
than the optimal solution that solves the original problem 
and sometimes the performance gap between them can not 
be ignored as seen in following experiments. To address 
this issue, we combine the benefits of distributed learning 
and the proposed randomized reduction methods for solv¬ 
ing big data problems. When data is too large and sits on 
multiple machines, distributed learning can be employed 
to solve the optimization problem. In distributed learning, 
individual machines iteratively solve sub-problems asso¬ 
ciated with the subset of data on them and communicate 
some global variables (e.g., the primal solution w £ K d ) 
among them. When the dimensionality d is very large, the 
total communication cost could be very high. To reduce the 
total communication cost, we propose to first solve the re¬ 
duced data problem and then use the found solution as the 
initial solution to the distributed learning for the original 
data. 
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_ Table L Statistics of datasets _ 

Name #Training #Testing #Features #Nodes 

RCV1 677,399 20,242 47, 236 5 

KDD 8,407,752 748,401 29,890,095 10 


Below, we demonstrate the effectiveness of DSRR for the 
recently proposed distributed stochastic dual coordinate as¬ 
cent (DisDCA) algorithm (Yang, 2013). The procedure is 
(1) reduce original high-dimensional data to very low di¬ 
mensional space on individual machines; (2) use DisDCA 
to solve the reduced problem; (3) use the optimal dual solu¬ 
tion to the reduce problem as an initial solution to DisDCA 
for solving the original problem. We record the running 
time for randomized reduction in step 1 and optimization 
of the reduced problem in step 2, and the optimization of 
the original problem in step 3. We compare the perfor¬ 
mance of four methods (i) the DSRR method that uses the 
model of the reduced problem solved by DisDCA to make 
predictions, (ii) the method that uses the recovered model 
in the original space, referred to as DSRR-Rec; (iii) the 
method that uses the dual solution to the reduced problem 
as an initial solution of DisDCA and runs it for the origi¬ 
nal problem with k = 1 or 2 communications (the number 
of updates before each communication is set to the num¬ 
ber of examples in each machine), referred to as DSRR- 
DisDCA-fc; and (iv) the distributed method that directly 
solves the original problem by DisDCA. For DisDCA to 
solve the original problem, we stop running when its per¬ 
formance on the testing data does not improve. Two data 
sets are used, namely RCV1-binary, KDD 2010 Cup data. 
For KDD 2010 Cup data, we use the one available on Lib- 
SVM data website. The statistics of the two data sets are 
summarized in Table 1 . The results averaged over 5 trials 
are shown in Figure 3, which exhibit that the performance 
of DSRR-DisDCA-1/2 is remarkable in the sense that it 
achieves almost the same performance of directly training 
on the original data (DisDCA) and uses much less training 
time. In addition, DSRR-DisDCA performs much better 
than DSRR and has small computational overhead. 

6. Conclusions 

In this paper, we have proposed dual-sparse regularized 
randomized reduction methods for classification. We pre¬ 
sented rigorous theoretical analysis of the proposed dual- 
sparse randomized reduction methods in terms of recovery 
error under a mild condition that the optimal dual vari¬ 
able is (nearly) sparse for both smooth and non-smooth 
loss functions, and for various randomized reduction ap¬ 
proaches. The numerical experiments validate our theoret¬ 
ical analysis and also demonstrate that the proposed reduc¬ 
tion and recovery framework can benefit distributed opti¬ 
mization by providing a good initial solution. 



Figure 3. Top: Testing error for different methods. Bottom: 
Training time for different methods. The value of A = 10 -5 and 
the value of r = 0.9. The high-dimensional features are reduced 
to m = 1024-dimensional space using random hashing. The loss 
function is the squared hinge loss. 
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A. Proof of Theorem 1 


Let F(a ) be defined as 
1 ” 

F(a) = -J2^(a i ) + 

n ^' 


—a 1 X'Xa+-\\a\\ 1 


Since ct* = a,rg min F(a) therefore for any g* £ 6 
0 >F{ 5*) — F(a*) 

>(5* - ct*) T |-VfK) + ~^X T Xa t 

\ n A n z 


a* i 


T ,— .-r 1 . 

H—(a* — ct*) <?* + -—H 
n znL 


ct* — ct* 


where we used the strong convexity of I* and its strong 
convexity modulus 1 / L. By the optimality condition of 
a*, we can have 


0 > (a* — a*) 


-Vf(a t ) + ) (16) 

n Xn A 


*112 


Combining the above two inequalities we have 

1 7” 1 

0 >(5* — a*) T — A H—(5* — ct») T g* + -—— Mac* — a 
n n 2nL 

Since the above inequality holds for any g * £ <9||a*||i, if 
we choose [g*]j = sign(\a*]i), i £ S c , then we have 
(5» - a*) T g» > -||[a*] 5 ^ [a*]s||i + ||[5*]s<=||i 
Combining the above inequalities leads to 
(r + ||a|U)ll|S,]s - [o,]s||, >(r - ||A|U||[5.] 5 .||, 




x *\\2 


( 17 ) 
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Assuming r > 2|| A||oo, we have 

||5» - ct*H 2 < 3tL||[ 5*] 5 - K]s||i 
||[5*]sc||i<3||[S,] s -[a.] 5 || 1 

Therefore, 

||[5* - a*]s||f < s||5* - ck*||| < 3rLs||[a *]5 - [a:*]s||i 

leading to the result 

||[a*]s - [a*]s||i < 3 tLs. 

Combing this inequality with inequalities in (18) we have 
||[a*]sc||i < 9 tLs, ||S* - a*|| 2 < 3tL yfs. 

B. Proof of Theorem 3 

Following the same proof of Theorem 1, we first notice that 
inequality (17) holds for L = 00, i.e., 

( r + l|A|| 00 )||[5*]s - Mslli >(r- ||A|| 00 )||[5*] 5 c|| 1 
Therefore if r > 2|| A||oo, we have 

||[a*]s c ||i < 3||[a*]s — [a*]s||i 

As a result, 


I a* — a*||i 
| a» — a* || 2 


< 


[a*]s — KJslli + ||[a»]5 c ||i 


< 


Ha* - a *|| 2 
[a*]s — [a*]s||i 


a* — a* 2 


< 4i/s 


a* — a* 


£ 1C 


n, 16 s • 


By the definition of KL n s , we have 

||a:* — a *|| 2 

To proceed the proof, there exists < 7 * £ <9|5*|i such that 

0 >( 5 * - ct*) T ( -Vr( 5 *) + -^-X T Xa< 

\ Tl An z 

T \T~ 

H— (g* — g* ) g * 
n 

Adding the above inequality with (16), we have 

0 > (a* - a*) T f -Vf(a t ) - -Vf (5*) 


+ (a* — a*) 


T 1 1 I T Ia, - - Kx T Xa < 


An 2 ' 


An 2 ' 


> 0 


+ ~ II[ce*]<S c Hi - - ||[a *]5 - [a*] 5 |li 

By convexity of £* we have 

(a* - 5 *) t -Vf(a t ) - -Vf(a.) 

n n 

Thus, we have 

T IP*]<s - [a*]s|li >r\\[a*]s4i 

T (g* — g*)”^ ( t— X~^ X — —— X~^X ] g* 

\\n An J 

- (a* - a*) T f(a* - 5*) 
\ A n An J 
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H—-—(a* — — a*) 
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Since 


(a* — a*) T A > —HAIIoolla* - a*||i, 


and r > 2|| A Hoc and by the definition of p s ,a s , we have 

y||[a.-a*] 5 || 1 >^||[a *] 5 «|| 1 


P 16 s g 16 s I 

An 


c *||2 


Then the conclusion follows the same analysis as before. 


C. Proof of Theorem 4 

Let F(a) be defined as 

1 U 1 

Ha) = - E + 2 A^'a T X T Xa + I||a||i 

2=1 


and F(a) be defined as 

1 " 1 

F M = -Y,eH«,) + ^aVXa 

2=1 

Since 5* = argmin F(a) therefore for any < 7 * £ 9||aJ||i 
0 >F(5*)-F«) 

>(5* - a*) T f-VfK) + -^X T *a; 

V n Arr 
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where we used the strong convexity of £* and its strong 
convexity modulus 1 /L. Due to the sub-optimality of a®, 
we have 


-||a* - a*||i£ > (a* - a *) 1 
n 


-VfK) + Tx t i q ; 

n An^ 


Combining the above two inequalities we have 


— ||a® — a*||i£ >(S* — a®) T t^(XX' -XX')a 
n \An z 

T ,— <? \T 1 

H—(a* — a*) g* + -—j 
n ZnL 

Since the above inequality holds for any < 7 * £ 9||a®||i, if 
we choose [g*]j = sign([a t ]i),i £ S c , then we have 
(£ + t)||[5*]s - [a*]s||i > -||A||oo||a* - a®||i (19) 


a* ce 2 II 2 
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Assuming r > 2(|| A||oo + ^), we have 
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Therefore, 
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leading to the result 

||[a *]5 - [a*] 5 ||i < 3 tLs 

















Dual-sparse Regularized Randomized Reduction (DSRR) 


Combing above inequality with inequalities in ( 1 8) we have 

||[5*]s=||i < 9 tLs, ||5* - a*|| 2 < 3 rLfas. 

D. Proof of Theorem 7 

Recall the definition of <S„ jS : 

Sn,s — { a G K" : 11ce|1 2 < 1, IMIo < s } 

Due to conv(S n , s ) C K. n ,s C 2 conv(S n ,s), for any a £ 
JC ntS , we can write it as a = 2 ]T/ : A ifa where fa £ S n , s , 
fa = 1 and A,; > 0, then we have 
\(Xa) T (I - A T A){Xa)\ 

x y{ i - a t a)( x Y Xi & 


< 4 


< 4^A,A j \(Xfa) T (I-A T A)(Xfa)\ 


<4 max |(Xai) T (J — A J A)(Xa. 2 )\ *Y XiX j 

Otl,OC2(zS n , s L ' 

= 4 max \(Xa{) T (I — AJ A)(Xa. 2 ) 

a!i,a!2(Ec>n,s 

Therefore 

max \(Xa) T (I — Afa A)(Xa)\ 

o:G/C n)S 

<4 max |(Xai) T (/ — 4 t A)(Jq 2 ) 

Oil, ^(Ec^n.s 

Let Ui = Xa\ and u 2 = Ja 2 . Following the Proof of 
Theorem 5, for any fixed a ±, a 2 £ S UtS , with a probability 
1 — 2h we have 

i|(Xa 1 ) T (/-A T A)(Xa 2 )| 


< -||*ai||2||Xa 2 || 2 e,M < pUa,6 < O I p +,/MlM 

n \ V m 

where we use 

||Xa|| 2 ^ ||Xa|| 2 faT 

max --/=— < max -= \ pj 

&£Sn,s fa Tl 0.£K.n lS fa‘IT' ' 

In order to extend the inequity to all a .\, a 2 £ <S n . s . We 

consider the e proper-net of S n _ s (Plan & Vershynin, 2011) 

denoted by S rL _ s (e). Lemma 3.3 in (Plan & Vershynin, 

2011) shows that the entropy of Sd, s , i.e., the cardinality 

of 5„ ;S (e) denoted N(S n , s , e) is bounded by 

/9n\ 

logiV(S„ >5 ,e) < slog ( — J 

Then by using the union bound, we have with a probability 
1 — 2(5, we have 

1„„ .X ,, 


max 


,«(«) 


,«(0 

< 0 


< 0 



log(5V 2 (5„ )S , e)/5) 


log(l / (5) + 2slog(9n/es) 


TO 


( 22 ) 


To proceed the proof, we need the following lemma. 

Lemma 5. Let 

£ s (a 2 ) = max 
«l£ Sn,s 

£ s (a 2 ,e) = max \aJUct 2 \ 

ai£Sn,s(e) 

For e £ (0, I/a/2), we have 

£s(a 2 ) < ^ _ X fa2e ) £s ^ 0l2,e ' ) 


Proof. Let U = 2-X T (I — A T A)X. Following Lemma 
9.2 of (Koltchinskii, 2011), for any a, a' £ <S„ iS , we can 
always find two vectors fa fa such that 

a-o! = P-fa, \\fa\o < s, \\fa\\o < s, fa 1 fa = 0. 

Let 

£ s (a 2 ) = max \aJUa 2 \ 
aiG5 n , s 


Thus 

| (a 


£ s (a 2 , e) = max \aJUa 2 \ 
ai £S n ,s(e) 


a',Ua 2 )| < \(faUa 2 }\ + \(-fa,Ua 2 }\ 



+ \\Fh 



<(||/3||2 + m\2)Ss(a 2 ) < S a (a 2 )V2y/m\l + \\fa\\l 

=£ s (a 2 )V 2||/3 - fah = £s(a 2 )V2\\P ~ Fh 

=f s (a 2 )v / 2||a - a'|| 2 . 


Then, we have 

£ s (a 2 ) = max |a T (7o; 2 | 
Q:G«S n ,s 


< max |a T C/a 2 | 

oi£Sn,s{e) 


sup (a 

ol £ Sjx , s 

ol ' £«S n)S (e), || a —Q-f H 2 <e 


<£ s (a 2 , e) + V2e£faa 2 ) 

which implies 

£s{cn 2 ) < 


g s (a 2 ,e) 

1 - a/2c‘ 


c/, (7a 2 ) 


□ 


Lemma 6. Let 

<? s (e) = max £ s (a 2 ,e)= max |aq[/a 2 | 

CK2£«S n s <*i£«Sn,s 

Q 2^ s n,s(e) 

£ s (e, e) = max £ s (a 2 ,e) = max |aqf7a 2 | 

a2£<S n,s(e) ai,a2G<S niS (e) 

For e G (0, I/a/ 2), we Fat'e 

£<(e)£ (r J 7 a) &(£ -' ) 

The proof the above lemma follows the same analysis as 
that of Lemma 1. By combining Lemma 1 and Lemma 2, 























Dual-sparse Regularized Randomized Reduction (DSRR) 


we have 


<y s = max £ s (a 2 ) < 

a 26 Sn,a 


1 


1 - 

1 


£ s (e) < 

2 


max a2 s 5 n s g s (q 2 ,e) 

1-V2e 
2 

£s(e,e) 


1 - y/2> 


max | aj U a 2 1 


, 1 — y/2e J ai,a 26 S„,a(e) 

By combing the above inequality with inequality (22), we 
have 

2 


cr, < 


1 


1 - V^t 


o (Pi 


log(l/5) + 2s log(9n/es) 


m 










